Search results for "Memory bandwidth"

showing 4 items of 4 documents

Efficient Parallel Sort on AVX-512-Based Multi-Core and Many-Core Architectures

2019

Sorting kernels are a fundamental part of numerous applications. The performance of sorting implementations is usually limited by a variety of factors such as computing power, memory bandwidth, and branch mispredictions. In this paper we propose an efficient hybrid sorting method which takes advantage of wide vector registers and the high bandwidth memory of modern AVX-512-based multi-core and many-core processors. Our approach employs a combination of vectorized bitonic sorting and load-balanced multi-threaded merging. Thread-level and data-level parallelism are used to exploit both compute power and memory bandwidth. Our single-threaded implementation is ~30x faster than qsort in the C st…

020203 distributed computingBitonic sorterSpeedupComputer scienceRadix sortSortingMemory bandwidth02 engineering and technologyParallel computingBitonic sorting020202 computer hardware & architecture0202 electrical engineering electronic engineering information engineeringsortqsortMerge sortBranch mispredictionXeon Phi2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

researchProduct

Neighbor-list-free molecular dynamics on sunway TaihuLight supercomputer

2020

Molecular dynamics (MD) simulations are playing an increasingly important role in many research areas. Pair-wise potentials are widely used in MD simulations of bio-molecules, polymers, and nano-scale materials. Due to a low compute-to-memory-access ratio, their calculation is often bounded by memory transfer speeds. Sunway TaihuLight is one of the fastest supercomputers featuring a custom SW26010 many-core processor. Since the SW26010 has some critical limitations regarding main memory bandwidth and scratchpad memory size, it is considered as a good platform to investigate the optimization of pair-wise potentials especially in terms of data reusage. MD algorithms often use a neighbor-list …

020203 distributed computingComputer science020207 software engineeringMemory bandwidth02 engineering and technologyParallel computingSW26010Data structureSupercomputerVectorization (mathematics)0202 electrical engineering electronic engineering information engineeringNode (circuits)Sunway TaihuLightScratchpad memoryProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

researchProduct

Massively parallel computation of atmospheric neutrino oscillations on CUDA-enabled accelerators

2019

Abstract The computation of neutrino flavor transition amplitudes through inhomogeneous matter is a time-consuming step and thus could benefit from optimization and parallelization. Next to reliable parameter estimation of intrinsic physical quantities such as neutrino masses and mixing angles, these transition amplitudes are important in hypothesis testing of potential extensions of the standard model of elementary particle physics, such as additional neutrino flavors. Hence, fast yet precise implementations are of high importance to research. In the recent past, massively parallel accelerators such as CUDA-enabled GPUs featuring thousands of compute units have been widely adopted due to t…

Computer scienceComputationGeneral Physics and AstronomyMemory bandwidth01 natural sciences010305 fluids & plasmasStandard ModelComputational scienceCUDAHardware and Architecture0103 physical sciencesNeutrino010306 general physicsNeutrino oscillationMassively parallelPhysical quantityComputer Physics Communications

researchProduct

Accelerated fluctuation analysis by graphic cards and complex pattern formation in financial markets

2009

The compute unified device architecture is an almost conventional programming approach for managing computations on a graphics processing unit (GPU) as a data-parallel computing device. With a maximum number of 240 cores in combination with a high memory bandwidth, a recent GPU offers resources for computational physics. We apply this technology to methods of fluctuation analysis, which includes determination of the scaling behavior of a stochastic process and the equilibrium autocorrelation function. Additionally, the recently introduced pattern formation conformity (Preis T et al 2008 Europhys. Lett. 82 68005), which quantifies pattern-based complex short-time correlations of a time serie…

PhysicsFloating pointSeries (mathematics)Stochastic processAutocorrelationGraphics processing unitGeneral Physics and AstronomyMemory bandwidthCentral processing unitScalingComputational scienceNew Journal of Physics

researchProduct